{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_03_5_weights.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# T81-558: Applications of Deep Neural Networks\n",
    "**Module 3: Introduction to TensorFlow**\n",
    "* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)\n",
    "* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 3 Material\n",
    "\n",
    "* Part 3.1: Deep Learning and Neural Network Introduction [[Video]](https://www.youtube.com/watch?v=zYnI4iWRmpc&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_03_1_neural_net.ipynb)\n",
    "* Part 3.2: Introduction to Tensorflow and Keras [[Video]](https://www.youtube.com/watch?v=PsE73jk55cE&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_03_2_keras.ipynb)\n",
    "* Part 3.3: Saving and Loading a Keras Neural Network [[Video]](https://www.youtube.com/watch?v=-9QfbGM1qGw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_03_3_save_load.ipynb)\n",
    "* Part 3.4: Early Stopping in Keras to Prevent Overfitting [[Video]](https://www.youtube.com/watch?v=m1LNunuI2fk&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_03_4_early_stop.ipynb)\n",
    "* **Part 3.5: Extracting Weights and Manual Calculation** [[Video]](https://www.youtube.com/watch?v=7PWgx16kH8s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_03_5_weights.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Google CoLab Instructions\n",
    "\n",
    "The following code ensures that Google CoLab is running the correct version of TensorFlow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    %tensorflow_version 2.x\n",
    "    COLAB = True\n",
    "    print(\"Note: using Google CoLab\")\n",
    "except:\n",
    "    print(\"Note: not using Google CoLab\")\n",
    "    COLAB = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 3.5: Extracting Weights and Manual Network Calculation\n",
    "\n",
    "### Weight Initialization\n",
    "\n",
    "The weights of a neural network determine the output for the neural network. The training process can adjust these weights, so the neural network produces useful output. Most neural network training algorithms begin by initializing the weights to a random state. Training then progresses through iterations that continuously improve the weights to produce better output.\n",
    "\n",
    "The random weights of a neural network impact how well that neural network can be trained. If a neural network fails to train, you can remedy the problem by simply restarting with a new set of random weights. However, this solution can be frustrating when you are experimenting with the architecture of a neural network and trying different combinations of hidden layers and neurons. If you add a new layer, and the network’s performance improves, you must ask yourself if this improvement resulted from the new layer or from a new set of weights. Because of this uncertainty, we look for two key attributes in a weight initialization algorithm:\n",
    "\n",
    "* How consistently does this algorithm provide good weights?\n",
    "* How much of an advantage do the weights of the algorithm provide?\n",
    "\n",
    "One of the most common yet least practical approaches to weight initialization is to set the weights to random values within a specific range. Numbers between -1 and +1 or -5 and +5 are often the choice. If you want to ensure that you get the same set of random weights each time, you should use a seed. The seed specifies a set of predefined random weights to use. For example, a seed of 1000 might produce random weights of 0.5, 0.75, and 0.2. These values are still random; you cannot predict them, yet you will always get these values when you choose a seed of 1000. \n",
    "Not all seeds are created equal. One problem with random weight initialization is that the random weights created by some seeds are much more difficult to train than others. The weights can be so bad that training is impossible. If you cannot train a neural network with a particular weight set, you should generate a new set of weights using a different seed.\n",
    "\n",
    "Because weight initialization is a problem, considerable research has been around it. By default, Keras uses the Xavier weight initialization algorithm, introduced in 2006 by Glorot & Bengio[[Cite:glorot2010understanding]](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf), produces good weights with reasonable consistency. This relatively simple algorithm uses normally distributed random numbers.  \n",
    "\n",
    "To use the Xavier weight initialization, it is necessary to understand that normally distributed random numbers are not the typical random numbers between 0 and 1 that most programming languages generate. Normally distributed random numbers are centered on a mean ($\\mu$, mu) that is typically 0. If 0 is the center (mean), then you will get an equal number of random numbers above and below 0. The next question is how far these random numbers will venture from 0. In theory, you could end up with both positive and negative numbers close to the maximum positive and negative ranges supported by your computer. However, the reality is that you will more likely see random numbers that are between 0 and three standard deviations from the center.\n",
    "\n",
    "The standard deviation ($\\sigma$, sigma) parameter specifies the size of this standard deviation. For example, if you specified a standard deviation of 10, you would mainly see random numbers between -30 and +30, and the numbers nearer to 0 have a much higher probability of being selected.  \n",
    "\n",
    "The above figure illustrates that the center, which in this case is 0, will be generated with a 0.4 (40%) probability. Additionally, the probability decreases very quickly beyond -2 or +2 standard deviations. By defining the center and how large the standard deviations are, you can control the range of random numbers that you will receive.\n",
    "\n",
    "The Xavier weight initialization sets all weights to normally distributed random numbers. These weights are always centered at 0; however, their standard deviation varies depending on how many connections are present for the current layer of weights. Specifically, Equation 4.2 can determine the standard deviation:\n",
    "\n",
    "$$ Var(W) = \\frac{2}{n_{in}+n_{out}} $$\n",
    "\n",
    "The above equation shows how to obtain the variance for all weights. The square root of the variance is the standard deviation. Most random number generators accept a standard deviation rather than a variance. As a result, you usually need to take the square root of the above equation. Figure 3.XAVIER shows how this algorithm might initialize one layer. \n",
    "\n",
    "**Figure 3.XAVIER: Xavier Weight Initialization**\n",
    "![Xavier Weight Initialization](images/xavier_weight.png)\n",
    "\n",
    "We complete this process for each layer in the neural network.  \n",
    "\n",
    "### Manual Neural Network Calculation\n",
    "\n",
    "This section will build a neural network and analyze it down the individual weights. We will train a simple neural network that learns the XOR function. It is not hard to hand-code the neurons to provide an [XOR function](https://en.wikipedia.org/wiki/Exclusive_or); however, we will allow Keras for simplicity to train this network for us. The neural network is small, with two inputs, two hidden neurons, and a single output. We will use 100K epochs on the ADAM optimizer. This approach is overkill, but it gets the result, and our focus here is not on tuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cycle #1\n",
      "[[0.49999997]\n",
      " [0.49999997]\n",
      " [0.49999997]\n",
      " [0.49999997]]\n",
      "Cycle #2\n",
      "[[0.33333334]\n",
      " [1.        ]\n",
      " [0.33333334]\n",
      " [0.33333334]]\n",
      "Cycle #3\n",
      "[[0.33333334]\n",
      " [1.        ]\n",
      " [0.33333334]\n",
      " [0.33333334]]\n",
      "Cycle #4\n",
      "[[0.]\n",
      " [1.]\n",
      " [1.]\n",
      " [0.]]\n"
     ]
    }
   ],
   "source": [
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.layers import Dense, Activation\n",
    "import numpy as np\n",
    "\n",
    "# Create a dataset for the XOR function\n",
    "x = np.array([\n",
    "    [0,0],\n",
    "    [1,0],\n",
    "    [0,1],\n",
    "    [1,1]\n",
    "])\n",
    "\n",
    "y = np.array([\n",
    "    0,\n",
    "    1,\n",
    "    1,\n",
    "    0\n",
    "])\n",
    "\n",
    "# Build the network\n",
    "# sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)\n",
    "\n",
    "done = False\n",
    "cycle = 1\n",
    "\n",
    "while not done:\n",
    "    print(\"Cycle #{}\".format(cycle))\n",
    "    cycle+=1\n",
    "    model = Sequential()\n",
    "    model.add(Dense(2, input_dim=2, activation='relu')) \n",
    "    model.add(Dense(1)) \n",
    "    model.compile(loss='mean_squared_error', optimizer='adam')\n",
    "    model.fit(x,y,verbose=0,epochs=10000)\n",
    "\n",
    "    # Predict\n",
    "    pred = model.predict(x)\n",
    "    \n",
    "    # Check if successful.  It takes several runs with this \n",
    "    # small of a network\n",
    "    done = pred[0]<0.01 and pred[3]<0.01 and pred[1] > 0.9 \\\n",
    "        and pred[2] > 0.9 \n",
    "    print(pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.], dtype=float32)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pred[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output above should have two numbers near 0.0 for the first and fourth spots (input [0,0] and [1,1]).  The middle two numbers should be near 1.0 (input [1,0] and [0,1]).  These numbers are in scientific notation.  Due to random starting weights, it is sometimes necessary to run the above through several cycles to get a good result.\n",
    "\n",
    "Now that we've trained the neural network, we can dump the weights.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0B -> L1N0: 1.3025760914331386e-08\n",
      "0B -> L1N1: -1.4192625741316078e-08\n",
      "L0N0                   -> L1N0 = 0.659289538860321\n",
      "L0N0                   -> L1N1 = -0.9533336758613586\n",
      "L0N1                   -> L1N0 = -0.659289538860321\n",
      "L0N1                   -> L1N1 = 0.9533336758613586\n",
      "1B -> L2N0: -1.9757269598130733e-08\n",
      "L1N0                   -> L2N0 = 1.5167843103408813\n",
      "L1N1                   -> L2N0 = 1.0489506721496582\n"
     ]
    }
   ],
   "source": [
    "# Dump weights\n",
    "for layerNum, layer in enumerate(model.layers):\n",
    "    weights = layer.get_weights()[0]\n",
    "    biases = layer.get_weights()[1]\n",
    "    \n",
    "    for toNeuronNum, bias in enumerate(biases):\n",
    "        print(f'{layerNum}B -> L{layerNum+1}N{toNeuronNum}: {bias}')\n",
    "    \n",
    "    for fromNeuronNum, wgt in enumerate(weights):\n",
    "        for toNeuronNum, wgt2 in enumerate(wgt):\n",
    "            print(f'L{layerNum}N{fromNeuronNum} \\\n",
    "                  -> L{layerNum+1}N{toNeuronNum} = {wgt2}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you rerun this, you probably get different weights.  There are many ways to solve the XOR function.\n",
    "\n",
    "In the next section, we copy/paste the weights from above and recreate the calculations done by the neural network.  Because weights can change with each training, the weights used for the below code came from this:\n",
    "\n",
    "```\n",
    "0B -> L1N0: -1.2913415431976318\n",
    "0B -> L1N1: -3.021530048386012e-08\n",
    "L0N0 -> L1N0 = 1.2913416624069214\n",
    "L0N0 -> L1N1 = 1.1912699937820435\n",
    "L0N1 -> L1N0 = 1.2913411855697632\n",
    "L0N1 -> L1N1 = 1.1912697553634644\n",
    "1B -> L2N0: 7.626241297587034e-36\n",
    "L1N0 -> L2N0 = -1.548777461051941\n",
    "L1N1 -> L2N0 = 0.8394404649734497\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0\n",
      "1.2\n",
      "0\n",
      "1.2\n",
      "0.96\n",
      "0.96\n"
     ]
    }
   ],
   "source": [
    "input0 = 0\n",
    "input1 = 1\n",
    "\n",
    "hidden0Sum = (input0*1.3)+(input1*1.3)+(-1.3)\n",
    "hidden1Sum = (input0*1.2)+(input1*1.2)+(0)\n",
    "\n",
    "print(hidden0Sum) # 0\n",
    "print(hidden1Sum) # 1.2\n",
    "\n",
    "hidden0 = max(0,hidden0Sum)\n",
    "hidden1 = max(0,hidden1Sum)\n",
    "\n",
    "print(hidden0) # 0\n",
    "print(hidden1) # 1.2\n",
    "\n",
    "outputSum = (hidden0*-1.6)+(hidden1*0.8)+(0)\n",
    "print(outputSum) # 0.96\n",
    "\n",
    "output = max(0,outputSum)\n",
    "\n",
    "print(output) # 0.96"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3.9 (tensorflow)",
   "language": "python",
   "name": "tensorflow"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
